ÁRBOL visualization tutorial
Author: Kyle Kimler
Compiled: Feb 3, 2023
ARBOL visualization methods enable users to track numerical and categorical data in a tree-like structure. In this tutorial we will go over the 3 types of data that we can "allot" and "propagate" in our tree graphs, and some ways we can extend them to many other different types of data - from normalized feature sets to QC variables that might clue us in to identify the species in our tree
Installation of ARBOL is currently done via install_github()
#devtools::install_github('jo-m-lab/ARBOL',force=T)
suppressPackageStartupMessages({
#Matrix.utils should be installed automatically when you install ARBOL.
#you may have to restart your R kernel to reset the package after initial installation
#now that Matrix.utils has dropped off of CRAN, you should load it separately to ensure proper installation
library(Matrix.utils)
library(ARBOL)
library(tidyverse)
})
Warning message in system("timedatectl", intern = TRUE):
“running command 'timedatectl' had status 1”
finally, load a seurat object for ARBOL. This tutorial uses the following model dataset:
srobj <- readRDS('srobjs/polyp_srobj.rds')
#Oftentimes you have to replace Seurat metadata rownames after using tidyverse functions
#tidyverse throws them out because they don't like rownames. Seurat, unfortunately, uses them.
#https://hbctraining.github.io/Intro-to-R/lessons/08_intro_tidyverse.html
srobj@meta.data$CellID <- row.names(srobj@meta.data)
tiers <- ARBOL::LoadTiersAsDF('ARBOL_output/polyp_r_sctransform/endclusts')
srobj@meta.data <- srobj@meta.data %>% left_join(tiers)
row.names(srobj@meta.data) <- srobj@meta.data$CellID
Warning message: “Expected 10 pieces. Missing pieces filled with `NA` in 19196 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].” Joining with `by = join_by(CellID)`
For ARBOL downstream analysis, you must provide a "sample" and "tierNident" (tierNident is included in the tiers dataframe in the previous cell) column in the Seurat object metadata
srobj@meta.data$sample <- srobj@meta.data$orig.ident
Tree-building methods in ARBOL supports categorical, count, diversity, and totals data. These four types of data are propagated up into the internal nodes of the tree respectively by:
ARBOL is written to calculate this way because internal-node membership is not explicit metadata in the seurat object, and exists best as part of the tree object in the @misc slot
srobj <- subclusteringTree(srobj, categories = c('sample','polyp','subset'))
Warning message: “Expected 10 pieces. Missing pieces filled with `NA` in 18036 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].” Joining with `by = join_by(name)`
#now each node in the nodes section of the ggraph contains lists of unique categories -
srobj@misc$cluster_ggraph %>% activate(nodes) %>% data.frame %>% select(node=name,sample,polyp,subset) %>% head
| node | sample | polyp | subset | |
|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | |
| 1 | T0C0 | Polyp1TOT, Polyp2TOT, Polyp3TOT, Polyp4TOT, Polyp5TOT, Polyp6ATOT, Polyp6BTOT, Polyp7TOT, Polyp8TOT, Polyp9TOT, Polyp11TOT, Polyp12TOT | YES, NO | Endothelial, Apical, Ciliated, PlasmaCell, Basal, Glandular, Myeloid, Fibroblast, MastCell, TCell |
| 2 | T1C6 | Polyp1TOT, Polyp2TOT, Polyp3TOT, Polyp4TOT, Polyp5TOT, Polyp6ATOT, Polyp6BTOT, Polyp7TOT, Polyp8TOT, Polyp9TOT, Polyp11TOT, Polyp12TOT | YES, NO | Endothelial, Fibroblast, Basal, TCell |
| 3 | T1C6.T2C2 | Polyp1TOT, Polyp2TOT, Polyp3TOT, Polyp4TOT, Polyp5TOT, Polyp6ATOT, Polyp8TOT, Polyp9TOT, Polyp11TOT, Polyp12TOT | YES, NO | Endothelial, Fibroblast, TCell |
| 4 | T1C6.T2C2.T3C2 | Polyp1TOT, Polyp2TOT, Polyp3TOT, Polyp11TOT, Polyp12TOT | YES, NO | Endothelial |
| 5 | T1C6.T2C2.T3C0 | Polyp1TOT, Polyp2TOT, Polyp9TOT, Polyp11TOT, Polyp12TOT | YES | Endothelial, Fibroblast |
| 6 | T1C6.T2C2.T3C1 | Polyp1TOT, Polyp2TOT, Polyp6ATOT, Polyp8TOT, Polyp9TOT, Polyp11TOT, Polyp12TOT | YES, NO | Endothelial, TCell |
#metadata added as category also provide a "majority" column!
srobj@misc$cluster_ggraph %>% activate(nodes) %>% data.frame %>% select(node=name,polyp,polyp_majority) %>% head
| node | polyp | polyp_majority | |
|---|---|---|---|
| <chr> | <chr> | <chr> | |
| 1 | T0C0 | YES, NO | YES |
| 2 | T1C6 | YES, NO | YES |
| 3 | T1C6.T2C2 | YES, NO | YES |
| 4 | T1C6.T2C2.T3C2 | YES, NO | YES |
| 5 | T1C6.T2C2.T3C0 | YES | YES |
| 6 | T1C6.T2C2.T3C1 | YES, NO | YES |
srobj <- subclusteringTree(srobj, categories = c('sample','polyp','subset'),
counts = c('sample','polyp','subset'))
Warning message: “Expected 10 pieces. Missing pieces filled with `NA` in 18036 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].” Joining with `by = join_by(name)`
counts tree data is stored in separate columns per unique member of each count category. These columns are named e.g. "polyp_n_YES" and "polyp_n_NO" and are counts of the number of cells in each category in each node.
srobj@misc$cluster_ggraph %>% activate(nodes) %>% data.frame %>%
select(node=name,polyp_n_YES,polyp_n_NO) %>% head
| node | polyp_n_YES | polyp_n_NO | |
|---|---|---|---|
| <chr> | <dbl> | <dbl> | |
| 1 | T0C0 | 10589 | 7447 |
| 2 | T1C6 | 746 | 442 |
| 3 | T1C6.T2C2 | 306 | 13 |
| 4 | T1C6.T2C2.T3C2 | 57 | 1 |
| 5 | T1C6.T2C2.T3C0 | 109 | 0 |
| 6 | T1C6.T2C2.T3C1 | 92 | 2 |
The first use we had for this was to show pie charts at each node displaying the composition of the node per category. This can be done using the pieify_tree_plot() function as follows - but does not work on "circular" or "tree" formats
scaling_factor = 8
options(repr.plot.width=16)
treeplot <- ggraph(srobj@misc$cluster_ggraph, layout='dendrogram') +
geom_edge_diagonal2(aes(color=node.polyp_majority, y = y*scaling_factor)) + theme_void() +
expand_limits(y=-5) + scale_edge_color_brewer(palette='Dark2') +
new_scale('color') +
geom_node_text(aes(filter=leaf, label = subset_majority, color = subset_majority), nudge_y=-1,vjust=0.5,hjust=0,angle=270,size=3) +
geom_node_point(aes(label=name),size=0)
#you can add ggplot manipulations directly to the output of this function
pieify_tree_plot(ggraph_plot = treeplot, srobj = srobj, color_metadata = 'polyp',
mode = 'subclusters', scaling_factor = scaling_factor, radius = 1) +
scale_fill_brewer(palette='Dark2')
Warning message in geom_node_point(aes(label = name), size = 0):
“Ignoring unknown aesthetics: label”
Joining with `by = join_by(name)`
Since we only obtained an unequal amount of cells from each disease category - roughly 10.5k with polyp, and 7.5k without polyp, you might want to normalize the counts of the categories to the total for each. We can manipulate the ggraph counts directly to calculate a normalized composition
srobj@meta.data %>% count(polyp)
totalYES <- srobj@meta.data %>% count(polyp) %>% pull(n) %>% .[2]
totalNO <- srobj@meta.data %>% count(polyp) %>% pull(n) %>% .[1]
totalYES
totalNO
srobj@misc$cluster_ggraph <- srobj@misc$cluster_ggraph %>% activate(nodes) %>%
mutate(normalized_polyp_n_NO = polyp_n_NO / totalNO,
normalized_polyp_n_YES = polyp_n_YES / totalYES)
| polyp | n |
|---|---|
| <chr> | <int> |
| NO | 7447 |
| YES | 10589 |
scaling_factor = 8
options(repr.plot.width=16)
treeplot <- ggraph(srobj@misc$cluster_ggraph, layout='dendrogram') +
geom_edge_diagonal2(aes(color=node.polyp_majority, y = y*scaling_factor)) + theme_void() +
scale_edge_color_brewer(palette='Dark2') +
expand_limits(y=-5) +
new_scale('color') +
geom_node_text(aes(filter=leaf, label = subset_majority, color = subset_majority), nudge_y=-1,vjust=0.5,hjust=0,angle=270,size=3) +
geom_node_point(aes(label=name),size=0)
pieify_tree_plot(ggraph_plot = treeplot, srobj = srobj, color_metadata = 'normalized_polyp',
mode = 'subclusters', scaling_factor = scaling_factor, radius = 1) +
scale_fill_brewer(palette='Dark2')
Warning message in geom_node_point(aes(label = name), size = 0):
“Ignoring unknown aesthetics: label”
Joining with `by = join_by(name)`
The diversities category allows users to rapidly implement diversity metrics from the vegan package on any metadata in the seurat object. Diversity metrics help us get an idea of community composition in a category that has many members. We often use it to show if certain clusters are composed of only one or two samples. We can also use it to see how many celltypes are present in the lower-resolution nodes
srobj <- subclusteringTree(srobj, categories = c('sample','polyp','subset'),
counts = c('polyp','subset'),
diversities = c('sample','polyp','subset'),
diversity_metric = 'simpson')
Warning message: “Expected 10 pieces. Missing pieces filled with `NA` in 18036 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].” Joining with `by = join_by(name)`
Tracking loss of diversity along the tree can give an idea of which celltypes tend to have more sample or subset-specificity.
We can color both edges and nodes by their diversity as follows:
ggraph(srobj@misc$cluster_ggraph, layout = 'dendrogram') +
geom_edge_diagonal2(aes(color=node.sample_diversity)) + scale_edge_color_gradient(low='grey90',high='grey10') +
new_scale('color') +
geom_node_point(aes(color=sample_diversity),size=4,shape='square') + scale_color_gradient(low='grey90',high='grey10') +
theme_void() +
new_scale('color') +
geom_node_point(aes(color=subset_diversity,y=y+0.15),size=4,shape='square') +
scale_color_distiller(palette='Spectral') +
geom_node_text(aes(filter=leaf,label=subset_majority,color=subset_diversity,angle=90,hjust=1,y=y-0.2)) +
expand_limits(y=-3)
the final type of data we can add to trees is "totals" - these are numerical data that we want to sum up the tree - so instead of counting cells, we are counting some other numerical data. This type of tree propagation can be used to track any type of cell-level numerical data - and since we also track the number of children of each node, and the number of cells in each node, we can normalize these data to how many cells are present. The most obvious use for this is for QC metadata - we can sum up averages and divide by the number of child nodes, or we can sum up
srobj <- PercentageFeatureSet(srobj, pattern = '^MT-',col.name='percent.mt')
srobj <- subclusteringTree(srobj, categories = c('sample','polyp','subset'),
counts = 'polyp', diversities = c('sample','polyp'),
totals = c('nCount_RNA','percent.mt'),
diversity_metric = 'simpson')
Warning message: “Expected 10 pieces. Missing pieces filled with `NA` in 18036 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].” Joining with `by = join_by(name)`
even though we can only sum numerical attributes or count categories, we can extend ARBOL's simplifications of these things to statistically amended versions by normalizing at each node once we have our counts
srobj@misc$cluster_ggraph <- srobj@misc$cluster_ggraph %>% activate(nodes) %>%
mutate(normalized_RNA_count = nCount_RNA/n)
options(repr.plot.width=8.5,repr.plot.height=6.75)
ggraph(srobj@misc$cluster_ggraph, layout='tree', circular=T) +
geom_edge_diagonal2(aes(color=node.normalized_RNA_count)) + theme_void() +
scale_edge_colour_distiller(palette=4)
Already we can see there are some subclusters that have a greater average number of transcripts!
Let's try to extend this to quality by tracking mitochondria percentage
if we sum percent.mt up into the tree, we end up with nonsensical percentages in the internal nodes - but we can count how many child nodes each internal node has
srobj@misc$cluster_ggraph %>% activate(nodes) %>% data.frame %>% colnames
srobj@misc$cluster_ggraph <- srobj@misc$cluster_ggraph %>% activate(nodes) %>%
mutate(normalized.mt = percent.mt/n)
options(repr.plot.width=8.5,repr.plot.height=6.75)
ggraph(srobj@misc$cluster_ggraph, layout='tree', circular=T) +
geom_edge_diagonal2(aes(color=node.normalized.mt)) + theme_void() +
scale_edge_colour_gradient(low='#00D9C0',high="#FF4365", trans='log2')
Looks like there's some clades that have a higher than average mitochondrial percentage!
We can label them to take a closer look
ggraph(srobj@misc$cluster_ggraph, layout='tree', circular=T) +
geom_edge_diagonal2(aes(color=node.normalized.mt)) + theme_void() +
scale_edge_colour_gradient(low='#00D9C0',high="#FF4365", trans='log2') +
geom_node_text(aes(label=subset_majority,angle = node_angle(-x, -y)))
Looks like it's one of the clades of Apical cells - an end-stage epithelial celltype. I wonder if they're the diseased apical cells?
ggraph(srobj@misc$cluster_ggraph, layout='tree', circular=T) +
geom_edge_diagonal2(aes(color=node.normalized.mt)) + theme_void() +
scale_edge_colour_gradient(low='#00D9C0',high="#FF4365", trans='log2') +
geom_node_point(aes(color=polyp_n_YES/n)) + scale_color_viridis()
Interesting - its actually the non-polyp apical cells that have higher mitochondrial count
srobj <- suppressMessages(RunPCA(srobj))
library(tictoc)
tic()
srobj <- suppressWarnings(suppressMessages(ARBOLcentroidTaxonomy(srobj,
categories = c('polyp','subset'), diversities = c('sample','polyp','subset'),
counts = c('polyp','subset'), totals = c('percent.mt','nCount_RNA'),
tree_reduction='pca',centroid_method = 'mean', distance_method = cosine,
hclust_method='complete',nboot=100)))
toc()
Bootstrap (r = 0.48)... Done. Bootstrap (r = 0.6)... Done. Bootstrap (r = 0.68)... Done. Bootstrap (r = 0.8)... Done. Bootstrap (r = 0.88)... Done. Bootstrap (r = 1.0)... Done. Bootstrap (r = 1.08)... Done. Bootstrap (r = 1.2)... Done. Bootstrap (r = 1.28)... Done. Bootstrap (r = 1.4)... Done. 125.632 sec elapsed
options(repr.plot.width=16)
ggraph(srobj@misc$tax_ggraph, layout = 'dendrogram') +
geom_edge_diagonal2(aes(color=node.sample_diversity)) + scale_edge_color_gradient(low='grey90',high='grey10') +
new_scale('color') +
geom_node_point(aes(color=sample_diversity),size=4,shape='square') + scale_color_gradient(low='grey90',high='grey10') +
theme_void() +
new_scale('color') +
geom_node_point(aes(color=subset_diversity,y=y+0.3),size=4,shape='square') +
scale_color_distiller(palette='Spectral') +
geom_node_text(aes(filter=leaf,label=subset_majority,color=subset_diversity,angle=90,hjust=1,y=y-1)) +
expand_limits(y=-3)
Warning message: “Existing variables `leaf` overwritten by layout variables”
srobj@misc$tax_ggraph <- srobj@misc$tax_ggraph %>% activate(nodes) %>%
mutate(normalized.mt = percent.mt/n)
srobj@misc$tax_ggraph <- srobj@misc$tax_ggraph %>% activate(nodes) %>%
mutate(normalized_RNA_count = nCount_RNA/n)
options(repr.plot.width=8.5,repr.plot.height=6.75)
ggraph(srobj@misc$tax_ggraph, layout='tree', circular=T) +
geom_edge_diagonal2(aes(color=node.normalized_RNA_count)) + theme_void() +
scale_edge_colour_distiller(palette=4)
ggraph(srobj@misc$tax_ggraph, layout='tree', circular=T) +
geom_edge_diagonal2(aes(color=node.normalized.mt)) + theme_void() +
scale_edge_colour_gradient(low='#00D9C0',high="#FF4365", trans='log2') +
geom_node_point(aes(color=polyp_n_YES/n)) + scale_color_viridis()
#You can quickly list tree metrics as follows - but don't list the genes!
srobj@misc$tree_metrics[names(srobj@misc$tree_metrics)!='gene_list']
function (x)
{
x <- as.matrix(x)
y <- t(x) %*% x
res <- 1 - y/(sqrt(diag(y)) %*% t(sqrt(diag(y))))
res <- as.dist(res)
attr(res, "method") <- "cosine"
return(res)
}sessionInfo()
R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] tictoc_1.1 forcats_0.5.2 purrr_1.0.1 readr_2.1.3 [5] tidyr_1.3.0 tibble_3.1.8 tidyverse_1.3.2 ARBOL_4.0.0 [9] dendextend_1.16.0 cluster_2.1.3 strex_1.6.0 stringr_1.5.0 [13] glmGamPoi_1.10.2 ggnewscale_0.4.8 scatterpie_0.1.8 ggalluvial_0.12.3 [17] phyloseq_1.42.0 ape_5.6-2 data.tree_1.0.0 ggraph_2.1.0 [21] tidygraph_1.2.3 ggrepel_0.9.2 ggpubr_0.5.0 ggfortify_0.4.15 [25] reshape2_1.4.4 vegan_2.6-4 lattice_0.20-45 permute_0.9-7 [29] harmony_0.1.1 Rcpp_1.0.10 pvclust_2.2-0 dplyr_1.1.0 [33] ggplot2_3.4.0 limma_3.54.1 future_1.31.0 data.table_1.14.6 [37] SeuratObject_4.1.3 Seurat_4.3.0 Matrix.utils_0.9.7 Matrix_1.5-3 loaded via a namespace (and not attached): [1] utf8_1.2.3 spatstat.explore_3.0-6 [3] reticulate_1.28 tidyselect_1.2.0 [5] htmlwidgets_1.6.1 grid_4.2.1 [7] Rtsne_0.16 munsell_0.5.0 [9] codetools_0.2-18 ica_1.0-3 [11] pbdZMQ_0.3-7 miniUI_0.1.1.1 [13] withr_2.5.0 spatstat.random_3.1-3 [15] colorspace_2.1-0 progressr_0.13.0 [17] Biobase_2.58.0 uuid_1.1-0 [19] stats4_4.2.1 ROCR_1.0-11 [21] ggsignif_0.6.4 tensor_1.5 [23] listenv_0.9.0 labeling_0.4.2 [25] MatrixGenerics_1.10.0 repr_1.1.4 [27] GenomeInfoDbData_1.2.9 polyclip_1.10-4 [29] farver_2.1.1 rhdf5_2.42.0 [31] parallelly_1.34.0 vctrs_0.5.2 [33] generics_0.1.3 timechange_0.2.0 [35] R6_2.5.1 GenomeInfoDb_1.34.8 [37] graphlayouts_0.8.4 DelayedArray_0.24.0 [39] bitops_1.0-7 rhdf5filters_1.10.0 [41] spatstat.utils_3.0-1 assertthat_0.2.1 [43] promises_1.2.0.1 scales_1.2.1 [45] googlesheets4_1.0.1 gtable_0.3.1 [47] Cairo_1.6-0 globals_0.16.2 [49] goftest_1.2-3 rlang_1.0.6 [51] splines_4.2.1 rstatix_0.7.2 [53] lazyeval_0.2.2 gargle_1.2.1 [55] checkmate_2.1.0 spatstat.geom_3.0-6 [57] broom_1.0.3 modelr_0.1.10 [59] abind_1.4-5 backports_1.4.1 [61] httpuv_1.6.8 tools_4.2.1 [63] ellipsis_0.3.2 biomformat_1.26.0 [65] RColorBrewer_1.1-3 BiocGenerics_0.44.0 [67] ggridges_0.5.4 plyr_1.8.8 [69] base64enc_0.1-3 zlibbioc_1.44.0 [71] RCurl_1.98-1.10 deldir_1.0-6 [73] pbapply_1.7-0 viridis_0.6.2 [75] cowplot_1.1.1 S4Vectors_0.36.1 [77] zoo_1.8-11 haven_2.5.1 [79] SummarizedExperiment_1.28.0 grr_0.9.5 [81] fs_1.6.0 magrittr_2.0.3 [83] scattermore_0.8 reprex_2.0.2 [85] lmtest_0.9-40 RANN_2.6.1 [87] googledrive_2.0.0 fitdistrplus_1.1-8 [89] matrixStats_0.63.0 hms_1.1.2 [91] patchwork_1.1.2 mime_0.12 [93] evaluate_0.20 xtable_1.8-4 [95] readxl_1.4.1 IRanges_2.32.0 [97] gridExtra_2.3 compiler_4.2.1 [99] KernSmooth_2.23-20 crayon_1.5.2 [101] htmltools_0.5.4 tzdb_0.3.0 [103] ggfun_0.0.9 mgcv_1.8-40 [105] later_1.3.0 lubridate_1.9.1 [107] DBI_1.1.3 tweenr_2.0.2 [109] dbplyr_2.3.0 MASS_7.3-58.1 [111] ade4_1.7-20 car_3.1-1 [113] cli_3.6.0 parallel_4.2.1 [115] igraph_1.3.5 GenomicRanges_1.50.2 [117] pkgconfig_2.0.3 sp_1.6-0 [119] IRdisplay_1.1 plotly_4.10.1 [121] spatstat.sparse_3.0-0 xml2_1.3.3 [123] foreach_1.5.2 multtest_2.54.0 [125] XVector_0.38.0 rvest_1.0.3 [127] digest_0.6.31 sctransform_0.3.5 [129] RcppAnnoy_0.0.20 spatstat.data_3.0-0 [131] Biostrings_2.66.0 cellranger_1.1.0 [133] leiden_0.4.3 uwot_0.1.14 [135] shiny_1.7.4 lifecycle_1.0.3 [137] nlme_3.1-158 jsonlite_1.8.4 [139] Rhdf5lib_1.20.0 carData_3.0-5 [141] viridisLite_0.4.1 fansi_1.0.4 [143] pillar_1.8.1 fastmap_1.1.0 [145] httr_1.4.4 survival_3.3-1 [147] glue_1.6.2 png_0.1-8 [149] iterators_1.0.14 ggforce_0.4.1 [151] stringi_1.7.12 IRkernel_1.3 [153] irlba_2.3.5.1 future.apply_1.10.0